Fix C tokenizer NUL byte truncation#358
Open
kimjune01 wants to merge 4 commits into
Open
Conversation
The C tokenizer was silently truncating input at the first NUL byte (\x00) because it used '\0' both as a valid input character and as the EOF sentinel. Root cause: - Tokenizer_read() returned '\0' for both: 1. Real NUL bytes in the input 2. End-of-input (when index >= text.length) - This made them indistinguishable, causing real NULs to be treated as EOF Fix: 1. Define TOKENIZER_EOF as 0x110000 (first invalid Unicode code point) 2. Update Tokenizer_read() and Tokenizer_read_backwards() to return TOKENIZER_EOF instead of '\0' for out-of-bounds reads 3. Replace all `!this` and `'\0'` checks with explicit `TOKENIZER_EOF` checks 4. Remove '\0' from the MARKERS array (no longer needed as EOF marker) 5. Move EOF check before is_marker() in main parse loop to ensure TOKENIZER_EOF doesn't try to emit as a character 6. Fix Tokenizer_has_leading_whitespace() to recognize TOKENIZER_EOF The Python tokenizer already preserved NUL bytes correctly; this brings the C tokenizer into parity. Regression test added: test_nul_byte_preservation() verifies that both tokenizers now preserve NUL bytes in plain text, templates, and multiple-NUL scenarios.
Four start-of-input checks in the main parse loop still used !last (falsy NUL) to detect beginning-of-input. After the TOKENIZER_EOF sentinel change, Tokenizer_read_backwards returns 0x110000 instead of '\0', so !last is always false and headings, lists, and horizontal rules at position 0 would silently fail to parse.
Tokenizer_read returns TOKENIZER_EOF for end-of-input, not a falsy value. The old truthiness check let NUL bytes truncate parsing.
Two bugs introduced by the NUL truncation fix: 1. Tokenizer_handle_invalid_tag_start: the tag name scanning loop checked is_marker() and Py_UNICODE_ISSPACE() to terminate, but TOKENIZER_EOF (0x110000) matches neither, causing an infinite loop when an incomplete closing tag like "</ref" reaches EOF without ">". 2. Tokenizer_parse colon handling: the bare external link check "this == ':' && !is_marker(last)" fired at start-of-input because is_marker(TOKENIZER_EOF) returns false (0x110000 not in MARKERS). This intercepted ":" before the list handler could run, breaking definition list items like ":text" → <dd>.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
'\0'both for real NUL bytes in the input and for end-of-input, causing silent truncation at the first NUL byte. This patch introduces aTOKENIZER_EOFsentinel (0x110000, first value outside valid Unicode range) and replaces all!this/!lasttruthiness checks with explicit== TOKENIZER_EOFcomparisons.'\0'from theMARKERSarray since NUL is no longer a special marker.Test plan
pytest tests/test_tokenizer.py::test_nul_byte_preservationpasses for both CTokenizer and PyTokenizer